Diamond prices is can be predicted according to its specifications such as carat, cut, color and clarity. Since x-y-z variables mostly defines carat they might be behaving as the same way as carat did. We can observe the correlation of each variable with price on the ggpairs plot. Some of the variables are categorical, in order to see the effect them on price can be seen after converting them into numeric variables. But plotting them as color identity in carat-price plot may give some clue about their behavior.
library(vtreat)
library(GGally)
library(gridExtra)
library(tidyverse)
set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
Lets have a look at train data:
glimpse(diamonds_train)
## Rows: 43,143
## Columns: 11
## $ carat <dbl> 0.21, 0.23, 0.29, 0.31, 0.24, 0.26, 0.22, 0.23, 0.30, 0.23…
## $ cut <ord> Premium, Good, Premium, Good, Very Good, Very Good, Fair, …
## $ color <ord> E, E, I, J, I, H, E, H, J, J, F, J, E, I, J, J, J, H, J, G…
## $ clarity <ord> SI1, VS1, VS2, SI2, VVS1, SI1, VS2, VS1, SI1, VS1, SI1, SI…
## $ depth <dbl> 59.8, 56.9, 62.4, 63.3, 62.3, 61.9, 65.1, 59.4, 64.0, 62.8…
## $ table <dbl> 61, 65, 58, 58, 57, 55, 61, 61, 55, 56, 61, 54, 62, 54, 54…
## $ price <int> 326, 327, 334, 335, 336, 337, 337, 338, 339, 340, 342, 344…
## $ x <dbl> 3.89, 4.05, 4.20, 4.34, 3.95, 4.07, 3.87, 4.00, 4.25, 3.93…
## $ y <dbl> 3.84, 4.07, 4.23, 4.35, 3.98, 4.11, 3.78, 4.05, 4.28, 3.90…
## $ z <dbl> 2.31, 2.31, 2.63, 2.75, 2.47, 2.53, 2.49, 2.39, 2.73, 2.46…
## $ diamond_id <int> 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 2…
summary(diamonds_train)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1285 D:5416 SI1 :10449 Min. :43.00
## 1st Qu.:0.4000 Good : 3923 E:7835 VS2 : 9806 1st Qu.:61.00
## Median :0.7000 Very Good: 9662 F:7629 SI2 : 7354 Median :61.80
## Mean :0.7985 Premium :11036 G:9037 VS1 : 6538 Mean :61.75
## 3rd Qu.:1.0400 Ideal :17237 H:6646 VVS2 : 4052 3rd Qu.:62.50
## Max. :5.0100 I:4336 VVS1 : 2923 Max. :79.00
## J:2244 (Other): 2021
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2403 Median : 5.700 Median : 5.710
## Mean :57.45 Mean : 3939 Mean : 5.732 Mean : 5.736
## 3rd Qu.:59.00 3rd Qu.: 5352 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18818 Max. :10.740 Max. :58.900
##
## z diamond_id
## Min. : 0.000 Min. : 2
## 1st Qu.: 2.910 1st Qu.:13586
## Median : 3.520 Median :26991
## Mean : 3.539 Mean :26995
## 3rd Qu.: 4.040 3rd Qu.:40454
## Max. :31.800 Max. :53940
##
diamonds_train %>% summarise_all(funs(sum(is.na(.)))) # is there any na values?
## # A tibble: 1 x 11
## carat cut color clarity depth table price x y z diamond_id
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 0 0 0
There is no missing values in the data. ### Exploring distribution of variables Since there are many variables, I will use the scatter plot matrix to get a quick and easy view of the distribution and correlation of different variables.
In order to keep it simple first investigate quantitative variables.
ggpairs(diamonds_train, columns = c(1,5,6,7), aes(alpha =0.5), title = 'Quantitative Variables vs Price')
ggpairs(diamonds_train, columns = c(2,3,4,7,8,9,10), aes(alpha =0.5),title = 'Qualitative Variables vs Price')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now we will take a detailed look at some of the relationships. I will use scaling for Price.
p1 <- ggplot(diamonds_train, aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.3) + scale_y_sqrt()
p2 <- ggplot(diamonds_train, aes(x = carat, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt() #to see how it doing with square root.
p3 <- ggplot(diamonds_train, aes(x = carat, y = price, color = clarity)) + geom_point(alpha = 0.3) + scale_y_sqrt()
p4 <- ggplot(diamonds_train, aes(x = x, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt()
p5 <- ggplot(diamonds_train, aes(x = y, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt()
p6 <- ggplot(diamonds_train, aes(x = z, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt()
grid.arrange(p1, p2, p3, p4, p5, p6, nrow= 3, ncol = 2)
First we look the price and carat relationship since it has most correlation. While doing this we add cut variable as color argument in the plot and take the root of price.
p1 <- ggplot(diamonds_train, aes(x = carat, y = price, color=cut)) +
geom_point(alpha = 0.3) + scale_y_sqrt()
p1
p2 <- ggplot(diamonds_train, aes(x = carat, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt() #to see how it doing with square root.
p2
p3 <- ggplot(diamonds_train, aes(x = carat, y = price, color = clarity)) + geom_point(alpha = 0.3) + scale_y_sqrt()
p3
p4 <- ggplot(diamonds_train, aes(x = x, y = price, color = color)) + geom_point(alpha = 0.3) + scale_y_sqrt()
p4
Firstly we try default value of price in model1:
model1 <- lm(price ~ ., data=diamonds_train[1:10])
summary(model1)
##
## Call:
## lm(formula = price ~ ., data = diamonds_train[1:10])
##
## Residuals:
## Min 1Q Median 3Q Max
## -21446.0 -593.0 -182.4 378.8 10701.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6032.795 440.677 13.690 < 2e-16 ***
## carat 11306.047 55.032 205.445 < 2e-16 ***
## cut.L 587.057 25.144 23.347 < 2e-16 ***
## cut.Q -300.964 20.114 -14.963 < 2e-16 ***
## cut.C 148.352 17.317 8.567 < 2e-16 ***
## cut^4 -26.568 13.826 -1.922 0.0547 .
## color.L -1949.838 19.404 -100.484 < 2e-16 ***
## color.Q -672.931 17.640 -38.148 < 2e-16 ***
## color.C -161.554 16.458 -9.816 < 2e-16 ***
## color^4 25.973 15.117 1.718 0.0858 .
## color^5 -98.819 14.278 -6.921 4.55e-12 ***
## color^6 -59.129 12.977 -4.556 5.22e-06 ***
## clarity.L 4129.825 33.880 121.897 < 2e-16 ***
## clarity.Q -1954.978 31.602 -61.863 < 2e-16 ***
## clarity.C 997.867 27.030 36.918 < 2e-16 ***
## clarity^4 -382.021 21.572 -17.709 < 2e-16 ***
## clarity^5 242.941 17.611 13.795 < 2e-16 ***
## clarity^6 12.562 15.328 0.820 0.4125
## clarity^7 87.466 13.525 6.467 1.01e-10 ***
## depth -65.123 4.992 -13.045 < 2e-16 ***
## table -29.084 3.250 -8.950 < 2e-16 ***
## x -1020.664 34.775 -29.350 < 2e-16 ***
## y -1.438 19.396 -0.074 0.9409
## z -38.006 33.858 -1.123 0.2616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 43119 degrees of freedom
## Multiple R-squared: 0.9202, Adjusted R-squared: 0.9202
## F-statistic: 2.162e+04 on 23 and 43119 DF, p-value: < 2.2e-16
Then if we take log of price it seems more straight and suitable for regression:
model2 <- lm(I(log(price)) ~ ., data = diamonds_train[1:10])
summary(model2)
##
## Call:
## lm(formula = I(log(price)) ~ ., data = diamonds_train[1:10])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3119 -0.0899 0.0008 0.0894 8.8470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7768048 0.0666391 -41.669 < 2e-16 ***
## carat -0.6806057 0.0083219 -81.785 < 2e-16 ***
## cut.L 0.1047956 0.0038023 27.561 < 2e-16 ***
## cut.Q -0.0368615 0.0030416 -12.119 < 2e-16 ***
## cut.C 0.0364380 0.0026187 13.915 < 2e-16 ***
## cut^4 0.0125579 0.0020907 6.007 1.91e-09 ***
## color.L -0.4518299 0.0029343 -153.981 < 2e-16 ***
## color.Q -0.1029256 0.0026675 -38.585 < 2e-16 ***
## color.C -0.0123329 0.0024888 -4.955 7.25e-07 ***
## color^4 0.0175279 0.0022860 7.667 1.79e-14 ***
## color^5 -0.0081027 0.0021592 -3.753 0.000175 ***
## color^6 0.0020423 0.0019624 1.041 0.298013
## clarity.L 0.8926304 0.0051233 174.230 < 2e-16 ***
## clarity.Q -0.2572644 0.0047788 -53.834 < 2e-16 ***
## clarity.C 0.1421625 0.0040874 34.781 < 2e-16 ***
## clarity^4 -0.0668565 0.0032621 -20.495 < 2e-16 ***
## clarity^5 0.0287787 0.0026632 10.806 < 2e-16 ***
## clarity^6 -0.0039185 0.0023179 -1.691 0.090933 .
## clarity^7 0.0284550 0.0020452 13.913 < 2e-16 ***
## depth 0.0537834 0.0007549 71.241 < 2e-16 ***
## table 0.0092119 0.0004914 18.745 < 2e-16 ***
## x 1.1979141 0.0052587 227.795 < 2e-16 ***
## y 0.0306626 0.0029331 10.454 < 2e-16 ***
## z 0.0388247 0.0051200 7.583 3.44e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1708 on 43119 degrees of freedom
## Multiple R-squared: 0.9717, Adjusted R-squared: 0.9717
## F-statistic: 6.445e+04 on 23 and 43119 DF, p-value: < 2.2e-16
new_testdf <- diamonds_test %>% select(-price,-diamond_id)
train_pred= predict(model1)
test_pred= predict(model1, new=new_testdf)
diamonds_test2 <- diamonds_test %>% select(-price,-diamond_id)
train_pred= predict(model2)
test_pred2= predict(model2, new=diamonds_test2)